Skip to content

First draft of Korean Cardinal ITN#281

Closed
hmlee245 wants to merge 21 commits intoNVIDIA:mainfrom
hmlee245:Draft/-Korean-Cardinal-ITN
Closed

First draft of Korean Cardinal ITN#281
hmlee245 wants to merge 21 commits intoNVIDIA:mainfrom
hmlee245:Draft/-Korean-Cardinal-ITN

Conversation

@hmlee245
Copy link
Copy Markdown

Sparrowhawk testing is not done yet.

What does this PR do ?

This draft PR is for checking the tagger ITN cardinal code and other files.

Before your PR is "Ready for review"

Pre checks:

  • Have you signed your commits? Use git commit -s to sign.
  • Do all unittests finish successfully before sending PR?
    1. pytest or (if your machine does not have GPU) pytest --cpu from the root folder (given you marked your test cases accordingly @pytest.mark.run_only_on('CPU')).
    2. Sparrowhawk tests bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
  • If you are adding a new feature: Have you added test cases for both pytest and Sparrowhawk here.
  • Have you added __init__.py for every folder and subfolder, including data folder which has .TSV files?
  • Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
  • Have you added the correct license header Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. to all newly added Python files?
  • If you copied nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should be Copyright 2015 and onwards Google, Inc.. See an example here.
  • Remove import guards (try import: ... except: ...) if not already done.
  • If you added a new language or a new feature please update the NeMo documentation (lives in different repo).
  • Have you added your language support to tools/text_processing_deployment/pynini_export.py.

PR Type:

  • New Feature
  • Bugfix
  • Documentation
  • Test
    Draft
    If you haven't finished some of the above items you can still open "Draft" PR.

@@ -0,0 +1,11 @@
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on how you implemented the rules, do we need this file?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept this file for tracking all the units for Korean cardinal, so yes I can delete it. But, for my cardinal taggers, I only expressed the number until the digit of ten quintillion(천경). I was wondering if I should go higher than that. If yes, I can write the code quickly, since it just all loops.

### Responsible for the first digit of number. ex) 1,2,3,4,5,,,
graph_ten_component += graph_digit | pynutil.insert("0")

hundred = pynutil.delete("백")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's have tests for each and every one of these blocks, including "delete" and "cross" the 백

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested by commenting out each lines and I do need both lines to have the test result that I wanted. "delete" line is specifically for hundredth digit that is not "1". (For ex. 200, 300, 400) "cross" line is specifically for if hundredth digit is 1. (100). Let me know if you meant something else by "testing" it.

leading_zero = (
pynutil.delete(pynini.closure("0")) + pynini.difference(NEMO_DIGIT, "0") + pynini.closure(NEMO_DIGIT)
)
graph_nonzero = graph @ leading_zero
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add tests for this

graph_nonzero = graph @ leading_zero
graph = pynini.union(graph_nonzero, graph_zero)

graph = graph @ leading_zero | graph_zero
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is graph @ leading_zero replaceable with graph_nonzero?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(let's also add tests)


final_graph = (
optional_sign + pynutil.insert(" ") + pynutil.insert("integer: \"") + graph + pynutil.insert("\"")
) | (pynutil.insert("integer: \"") + graph + pynutil.insert("\""))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the sign is already optional, do we need this line? or is it enough to also make the first space part of the optional sign as well?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I still need it for showing result on terminal. Optional sign could be replaced with graph_negative if I redefine properly...?
Realized I am not using graph_negative as I am replying to this comment. Deleted graph_negative and one more line related to it(line 39, 40)

pynutil.delete("negative:")
+ delete_space
+ pynutil.delete("\"")
+ pynini.accep("-")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could also make it "negative": "true" in the tagger and do pynini.cross("\"true\"") here -- check how English does it

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't see it on ITN part of English, but if you are referring to TN part of English, I do see it. Do you want me to change it in that format?

def __init__(self):
super().__init__(name="cardinal", kind="classify")

graph_zero = pynini.string_file(get_abs_path("data/numbers/zero.tsv"))

Check warning

Code scanning / CodeQL

Variable defined multiple times

This assignment to 'graph_zero' is unnecessary as it is [redefined](1) before this value is used.
Comment on lines +22 to +30
from nemo_text_processing.inverse_text_normalization.ko.graph_utils import (
INPUT_LOWER_CASED,
GraphFst,
delete_extra_space,
delete_space,
generator_main,
)

Check notice

Code scanning / CodeQL

Unused import

Import of 'delete_extra_space' is not used. Import of 'delete_space' is not used.

from nemo_text_processing.inverse_text_normalization.ko.graph_utils import GraphFst, delete_space, generator_main
from nemo_text_processing.inverse_text_normalization.ko.verbalizers.verbalize import VerbalizeFst
from nemo_text_processing.inverse_text_normalization.ko.verbalizers.word import WordFst

Check notice

Code scanning / CodeQL

Unused import

Import of 'WordFst' is not used.
# limitations under the License.


import pynini

Check notice

Code scanning / CodeQL

Unused import

Import of 'pynini' is not used.
import pynini
from pynini.lib import pynutil

from nemo_text_processing.inverse_text_normalization.ko.graph_utils import NEMO_NOT_QUOTE, GraphFst, delete_space

Check notice

Code scanning / CodeQL

Unused import

Import of 'delete_space' is not used.
from parameterized import parameterized

from nemo_text_processing.inverse_text_normalization.inverse_normalize import InverseNormalizer
from nemo_text_processing.text_normalization.normalize import Normalizer

Check notice

Code scanning / CodeQL

Unused import

Import of 'Normalizer' is not used.
Sparrowhawk testing is not done yet.

Signed-off-by: hmlee245 <hmlee245@gmail.com>
@hmlee245 hmlee245 force-pushed the Draft/-Korean-Cardinal-ITN branch from d8e057d to fa304a2 Compare May 13, 2025 22:03
@hmlee245 hmlee245 force-pushed the Draft/-Korean-Cardinal-ITN branch from a2996ba to 2c832f5 Compare May 13, 2025 22:18
hmlee245 and others added 9 commits May 13, 2025 15:19
…245/NeMo-text-processing into Draft/-Korean-Cardinal-ITN

Signed-off-by: hmlee245 <hmlee245@gmail.com>
…245/NeMo-text-processing into Draft/-Korean-Cardinal-ITN

Signed-off-by: hmlee245 <hmlee245@gmail.com>
Fixes issue with sparrowhawk builds as the original base image is no longer maintained and build breaks

Signed-off-by: anand-nv <105917641+anand-nv@users.noreply.github.com>
Signed-off-by: hmlee245 <hmlee245@gmail.com>
* Future implementations to date.py - Hindi ITN  (NVIDIA#265)

* Addition of whitelist and word classes

Signed-off-by: Tarushi V <tarushiv@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Updation of Jenkins date

Signed-off-by: Tarushi V <tarushiv@nvidia.com>

* Cleanup

Signed-off-by: Tarushi V <tarushiv@nvidia.com>

* Updation

Signed-off-by: Tarushi V <tarushiv@nvidia.com>

* Updation

Signed-off-by: Tarushi V <tarushiv@nvidia.com>

* Future implementations for date

Signed-off-by: Tarushi V <tarushiv@nvidia.com>

* pushing rough date code for ref

Signed-off-by: Tarushi V <tarushiv@nvidia.com>

* Future implementations date.py

Signed-off-by: Tarushi V <tarushiv@nvidia.com>

* Cleanup

Signed-off-by: Tarushi V <tarushiv@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Updation of Jenkinsfile

Signed-off-by: Tarushi V <tarushiv@nvidia.com>

* Telephone.py-hindi itn

Signed-off-by: Tarushi V <tarushiv@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Telephone.py - Hindi ITN

Signed-off-by: Tarushi V <tarushiv@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Telephone modified tagger and verbalizer

Signed-off-by: Tarushi V <tarushiv@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* telephone tagger with 3,4,5 digit std codes

Signed-off-by: Tarushi V <tarushiv@nvidia.com>

* Further additions - telephone.py

Signed-off-by: Tarushi V <tarushiv@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Jenkins update

Signed-off-by: Tarushi V <tarushiv@nvidia.com>

* Telephone.py

Signed-off-by: Tarushi V <tarushiv@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Updated tagger-telephone.py

Signed-off-by: Tarushi V <tarushiv@nvidia.com>

* Telephone and Jenkinsfile cleanup

Signed-off-by: Tarushi V <tarushiv@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update Jenkins

Signed-off-by: Tarushi V <tarushiv@nvidia.com>

---------

Signed-off-by: Tarushi V <tarushiv@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Anand Joseph <anajoseph@nvidia.com>

* Add missing __init__.py file

Signed-off-by: Anand Joseph <anajoseph@nvidia.com>

---------

Signed-off-by: Tarushi V <tarushiv@nvidia.com>
Signed-off-by: Anand Joseph <anajoseph@nvidia.com>
Co-authored-by: tarushi2k2 <tarushiv@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: hmlee245 <hmlee245@gmail.com>
* add base coverage for fr tn date

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: hmlee245 <hmlee245@gmail.com>
* [pre-commit.ci] pre-commit suggestions

updates:
- [github.com/pre-commit/pre-commit-hooks: v4.3.0 → v5.0.0](pre-commit/pre-commit-hooks@v4.3.0...v5.0.0)
- [github.com/PyCQA/flake8: 7.1.1 → 7.2.0](PyCQA/flake8@7.1.1...7.2.0)
- [github.com/PyCQA/isort: 5.12.0 → 6.0.1](PyCQA/isort@5.12.0...6.0.1)
- [github.com/psf/black: 19.10b0 → 25.1.0](psf/black@19.10b0...25.1.0)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: hmlee245 <hmlee245@gmail.com>
Sparrowhawk testing is not done yet.

Signed-off-by: hmlee245 <hmlee245@gmail.com>
hmlee245 and others added 5 commits May 13, 2025 15:38
Haven't fixed the graph@leading zero part

Signed-off-by: hmlee245 <hmlee245@gmail.com>
for more information, see https://pre-commit.ci

Signed-off-by: hmlee245 <hmlee245@gmail.com>
Signed-off-by: hmlee245 <hmlee245@gmail.com>
for more information, see https://pre-commit.ci

Signed-off-by: hmlee245 <hmlee245@gmail.com>
@hmlee245 hmlee245 closed this May 16, 2025
@hmlee245 hmlee245 mentioned this pull request May 16, 2025
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants